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Abstract 


We  are  proposing  various  parallel  constructs  for  circular  correlation,  suitable  to  be  used  in  GPS  receivers. 
The  potential  increase  in  performance  is  2-fold+  when  using  a  FPGA  target  architecture.  We  have 
preliminary  Simulink  models  that  yield  the  correct  result  for  the  proposed  parallel  architectures. 

We  developed  and  validated,  using  field  recorded  GPS  satellite  data,  a  MATLAB  implementation  of  the 
MIT-Quicksynch  algorithm  for  fast  circular  correlation  based  on  the  sparse  characteristics  of  the  circular 
correlation  output.  Simulink  models  for  such  algorithm  are  under  work.  As  a  spinoff  one  of  our  student 
research  assistants  proposed,  and  was  selected,  to  implement  the  mentioned  algorithm  in  an  Open  Source 
platform  under  the  auspices  of  the  Google  Summer  of  Code  2014 

Since  the  DFT  can  be  written  in  the  form  of  a  cyclic  convolution,  we  used  our  sectioned  cyclic  convolution 
algorithm  for  parallel  implementations  of  a  one-dimensional  DFT.  Such  implementations  are  slower  than 
serial  FFT  implementations  but  can  access  the  larger  memory  available  in  clusters.  Although  not  as  efficient 
as  the  traditional  row-column  implementation,  and  also  length-constrained,  our  method  was  able  to  tackle 
signal  lengths  that  were  not  suitable  for  the  row-column  algorithm. 

Time  domain-based  implementations  of  cyclic  convolution  are  dramatically  slower  but,  unlike  frequency 
domain-based  algorithms,  give  the  exact  integer  output  when  the  input  is  a  sequence  of  integers.  For  length- 
223  signals  we  obtained  speedups  of  up  to  418  for  the  64-core  parallel -recursive  implementation  versus  the 
1-core  serial,  0(N2)  implementation  and  speedups  of  up  to  17  for  the  parallel -recursive  versus  the  1-core 
serial-recursive  implementation. 


Objectives 

Original  Project  Objectives 

a)  Find  out  the  relative  performance  of  the  newly  developed  parallel  cyclic  convolution  algorithms 
described  in  our  proposal  to  compute  very  long  length  cyclic  convolutions  [1],  [2],  [5],  [15]. 
( Completed ) 

b)  Establish  performance  benchmarks  regarding  the  newly  introduced  architectures  for  parallel 
computation  of  long-length  one-dimensional,  cyclic  convolution  within  different  cluster  and  multi¬ 
core  configurations  [4],  [11],  [12],  ( Completed ) 

c)  Find  out  the  limiting  computational  factors  (length,  memory,  inter-processor  communication  and 
the  like).  As  the  problem  length  increases  monitor  which  factor  becomes  the  limiting  computational 
bottleneck  [1],  (Ongoing) 

d)  Introduce,  at  the  HPC  laboratory,  the  use  of  MATLAB,  the  Parallel  Computing  Toolbox  and  the 
MATLAB  Distributed  Computing  Server.  These  software  packages  were  selected  to  provide  a 
rapid  prototyping  and  an  entry  level  research  environment.  Afterward,  implementations  in  C++ 
language  will  follow  [1],  [2],  [5],  [13].  ( Completed ) 
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e)  Give  a  proof  for  two  operator-oriented  formulas  suitable  to  unfold  the  data  flow  graph  of  cyclic 
and  no-cyclic.  Such  formulas  were  developed  and  published  under  a  previous  ARO  grant  [16]. 
( Completed ) 

f)  Study  and  formulate  parallel  one-dimensional  Discrete  Fourier  Transform  implementations  based 
on  the  proposed  parallel  cyclic  convolution  architectures.  Compare  and  benchmark  against 
traditional  approaches  [4],  [11],  [12],  ( Completed ) 

g)  Increase  the  number  of  minority  students  that  are  involved  and/or  are  aware  of  issues  relating  DSP 
algorithms  and  parallel  processing.  (Ongoing). 


Future  Work  Objectives 


a)  All  of  our  structures  for  parallel  cyclic  convolution  can  be  applied  to  parallel  circular  correlation. 
Circular  correlation  is  used,  among  others,  in  GPS  receivers.  We  are  proposing  several  structures 
that  complement  the  ones  presented  in  2012  by  a  group  from  the  Ecole  Polytechnique  (Laussane), 
several  of  which  are  basically  the  same  that  we  had  originally  proposed.  These  parallel  algorithms 
have  a  potential  for  a  2-fold+  increase  in  performance  [8],  [9],  [10].  (Ongoing  Work) 

b)  We  are  involved  in  the  implementation  and  validation  of  the  MIT  developed  Quicksynch  algorithm 
for  fast  circular  correlation  in  GPS  SDR  systems.  The  algorithm  is  based  on  the  sparsity  of  the 
correlation  output.  We  developed  MATLAB  code  validated  with  actual  GPS  satellite  signal 
records.  Simulink  models  are  under  work.  As  a  byproduct  of  our  project  one  of  our  student  research 
assistants  proposed,  and  was  selected,  to  implement  this  same  algorithm  in  an  Open  Source 
platform  under  the  auspices  of  the  Google  Summer  of  Code  2014  program  [8],  [9],  [10].  (Ongoing 
Work) 


Significance 


Large  Data  Sets:  Today  technology  demands  the  processing  of  ever  larger  data  sets.  One  particular 
problem  when  processing  large  signals  is  to  split  a  large  processing  operation,  such  as  cyclic 
convolution,  into  smaller  subtasks  in  order  to  access  the  much  larger  memory  available  in  clusters. 
Our  sectioned  cyclic  convolution  algorithm  can  provide  parallel  frequency  domain-based 
implementations  as  well  as  parallel  time  domain-based  implementations  of  cyclic  convolution. 
Time  domain-based  implementations  can  be  further  accelerated  by  using  our  serial-recursive 
approach  in  the  parallel  subsections.  Parallel  implementations  of  FFT-based  cyclic  convolution  are 
likely  to  be  slower  than  the  serial  FFT-based  implementation  but  have  the  potential  to  access  the 
much  larger  memory  available  in  clusters.  Parallel  implementations  of  cyclic  convolution  using 
time-domain  based  algorithms,  which  guarantee  the  exact  integer  result  when  performing  purely 
integer  convolution,  are  likely  to  be  faster  than  their  serial  time  domain-based,  counterparts. 
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Round-off  Errors  in  Frequency  Domain-Based,  Purely  Integer,  Cyclic  Convolution:  Another 
significant  problem,  especially  when  dealing  with  very  large  length  sequences,  are  floating  point 
errors  that  affect  FFTs  to  the  point  that  a  FFT -based  cyclic  convolution  implementation  of  integer 
sequences  may  not  give  the  exact  integer  results.  This  is  a  problem  in  applications  such  as 
multiplication  of  large  integers,  computer  algebra  packages,  computational  number  theory  and 
others.  Our  parallel,  or  serial-recursive,  sectioned  implementation  of  cyclic  convolution  can  use 
time-domain  based  algorithms  for  the  parallel,  or  serial-recursive,  sub-convolution  stages  and 
therefore  will  guarantee  the  exact  integer  output  when  performing  purely  integer  circular 
convolution.  If  such  time-domain  based  algorithms  are  sufficiently  accelerated  they  may  be 
considered  as  an  alternative  for  lengths  where  other  error-avoiding  fast  cyclic  convolution 
techniques  such  as  the  ones  using  Number  Theoretic  Transforms,  are  not  applicable  or  are  slower. 
Furthermore,  if  our  techniques  are  used  for  a  frequency  domain-based  implementation  of  cyclic 
convolution  the  round-off  errors  are  reduced  due  to  the  shorter  length  sequences  in  the  subsections. 

Parallel  implementation  of  one-dimensional  DFTs:  The  most  common  technique  used  to 
implement  a  one -dimensional  DFT  in  clusters  or  distributed  environments  has  certain  lengths 
constraints.  We  are  proposing  a  new  technique,  which  is  not  as  efficient  but  can  tackle  certain  signal 
lengths  that  are  not  suitable  for  the  standard  row-column  approach. 

Applications  to  Parallel  Circular  Correlation  (New  Research  Direction):  Circular  correlation  is  an 
operation  widely  used,  most  conspicuously  in  GPS  receivers.  Parallelization  can  result  in  a  2-fold+ 
increase  in  performance  in  FPGAs  implementations.  It  turns  out  that  our  structures  for  cyclic 
convolution  are  readily  applicable  to  this  task.  We  intend  to  propose  several  structures  that 
complement  the  ones  presented  in  2012  by  a  group  from  the  Ecole  Polytechnique  (Laussane). 


Accomplishments 

Theoretical  Work,  Code  Development  and  Benchmarking 

Parallel  One -Dimensional  DFT:  Developed  a  novel  algorithm  for  parallel  implementation  of  a 
one-dimensional  Discrete  Fourier  Transform.  This  algorithm  can  tackle  signal  lengths  that  are 
complementary  to  those  amenable  to  an  implementation  using  the  more  efficient  “row-column” 
approach  (standard  method).  Parallel  implementations  of  Fast  Fourier  Transforms  (FFTs)  are  in 
general  slower  than  the  serial  counterparts  but  they  allow  the  processing  of  larger  sequences  [4], 
[11],  [12].  (To  be  submitted  for  publication) 

Reduction  of  Round-off  Errors  in  Frequency  Domain-Based  Cyclic  Convolution:  Using  our  cyclic 
convolution  structures  we  were  able  to  lower,  or  eliminate,  round-off  errors  when  using  Fast 
Fourier  Transforms  (FFTs)  to  compute  cyclic  convolution  of  very  large  sequences.  C++  and 
MATLAB  implementations  were  benchmarked  [2],  [5].  ( Published ) 
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Time  Domain-Based  Parallel  Cyclic  Convolution:  We  implemented  and  benchmarked  parallel 
cyclic  convolution  algorithms  based  on  our  novel  structures.  Such  implementation  achieved 
impressive  results  when  accelerating  time  domain-based  cyclic  convolution  algorithms.  FFT -based 
implementations  of  cyclic  convolution  cannot  be  accelerated  using  this  method.  Time  domain- 
based  cyclic  convolution  algorithms  are  slower  than  FFT-based  algorithms  but  guarantee  the  exact 
integer  result  for  purely  integer  cyclic  convolution,  which  is  important  in  certain  areas  such  as 
multiplication  of  large  integers,  computer  algebra,  cryptology,  computational  number  theory, 
experimental  mathematics  and  others  [1],  [2],  [5],  [15].  {To  be  submitted  for  publication). 

Mixed  Mode  MPI-OpenMP  versus  MPI  Implementations:  We  benchmarked  a  mixed  mode  MPI- 
OpenMP  C++  implementation  of  parallel  cyclic  convolution  versus  a  direct  MPI  implementation. 
In  this  case-study,  even  though  we  matched  the  algorithm  structure  in  the  mixed-mode  approach  to 
the  multicore  node  architecture  of  the  cluster,  the  direct  MPI  approach  proved  to  be  slightly  better 
[1],  [5].  ( Published ). 

New  Research  Direction  (Parallel  Correlators  for  GPS  Receivers):  All  of  our  parallel  structures 
for  parallel  cyclic  convolution  can  be  applied  to  parallel  circular  correlation.  Circular  correlation  is 
used,  among  others,  in  GPS  receivers.  We  developed  parallel  algorithms  that  are  complementary 
to  the  ones  presented  in  2012  by  a  group  from  the  Ecole  Polytechnique,  Laussana  (several  of  which 
are  the  same  that  we  originally  proposed  in  the  first  place).  Specifically,  we  can  propose  a  variant 
that  avoids  the  use  of  numerically  controlled  oscillators,  which  they  use  in  their  constructs,  and  we 
will  propose  further  complementary  architectures.  We  have  preliminary  Simulink  models  validated 
with  modeled  data  and  we  are  developing  Simulink  models  that  will  work  with  field  recorded  GPS 
signal  data  [8],  [9],  [10].  (To  be  submitted  for  publication). 

Further  Code  Development  (GPS  Correlators):  We  developed  and  validated  MATLAB  code  and 
Simulink  models  and  libraries  for  standard  baseline  GPS  algorithm  to  implement  GPS  correlators. 
This  is  to  be  used  as  a  testbed  and  to  compare  accuracy  and  performance  with  the  parallel  correlator 
constructs  that  we  are  proposing.  We  also  did  a  MATLAB  implementation  of  an  algorithm 
developed  at  MIT  (Quicksynch)  for  GPS  SDR  receivers  based  on  the  sparse  nature  of  the  correlator 
output  [8],  [9],  [10], 

Improved  Digit  Reversed  Routine  for  MATLAB .  As  part  of  the  code  development  effort,  we  were 
able  to  generate  fast,  efficient,  stride  permutation  routines  in  MATLAB.  We  used  such  routines  to 
do  an  implementation  of  digit  reversal  that,  for  large  sequences,  is  running  faster  than  the  native 
MATLAB  digit  reversal  command.  We  can  also  tackle  larger  sequence  lengths  and  larger  radixes 
than  the  ones  allowed  by  the  MATLAB  native  implementation  [14].  (To  be  submitted  to  a  student 
conference). 

Further  Theoretical  Work:  Development  of  an  induction  proof  for  two  operator-oriented  formulas 
suitable  to  unfold  data  flow  graphs  of  cyclic  and  linear  shifts.  This  proof  supports  the  correctness 
of  such  formulas,  developed  and  published  while  working  in  a  previous  ARO  project  [16]. 
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Maintenance  and  Upgrade  of  our  Hardware/Software  Testbed 


Hardware  ( Cluster):  Updated  and  made  operational  a  64  processor,  16-node  Dell  cluster  purchased 
under  a  previous  ARO  grant.  The  cluster  is  dated  but  adequate  for  benchmarking  and  scalability 
studies.  An  operations  manual  was  developed  [13]. 

Hardware  (GPU):  The  cluster  was  further  upgraded  by  installing  four  additional  nodes  with  GPU 
processors.  This  particular  architecture  will  be  used  for  further  research  [13]. 

Software  (MATLAB,  C++):  A  64-worker  MATLAB  Distributed  Computing  Server  was  purchased 
and  installed  in  the  cluster.  We  can  also  use  C++  in  the  cluster  [13]. 

Software  (MATLAB):  The  MATUAB  Parallel  Computing  Toolbox  was  installed  in  two  multi-core 
Dell  servers  (4  and  12  processors),  cluster  master  node,  and  3  multi -core  laptops.  These  computers 
are  to  be  used  as  MATLAB  clients  for  the  cluster  and  as  multi-core  testing  platforms.  We  are 
gaining  expertise  in  the  use  of  MATLAB  for  parallel  DSP  algorithm  development. 


Collaborations 


We  are  having  internal  collaboration,  within  our  own  institution,  with  a  different  group  of 
researchers/funding  where  we  are  making  available  our  expertise,  hardware  resources  and  parallel 
processing  capabilities  using  MATLAB.  The  research  is  a  subset  of  a  larger  project  in  our  Plasma 
Physics  Lab,  funded  by  NRC,  and  is  related  to  the  development  of  a  model  to  study  electric  particle 
containment  within  the  plasma  chamber. 


Conclusions 


Time  Domain-Based  Parallel  Cyclic  Convolution:  The  performed  benchmark  shows  that  our 
parallel  implementation,  combined  with  our  serial  recursive  implementation,  of  cyclic  convolution 
is  suitable  to  accelerate  time  domain-based  cyclic  convolution  algorithms.  The  serial  recursive 
implementation  on  its  own  can  also  accelerate  time -domain  based  cyclic  convolution 
implementations.  Time-domain  based  implementation  guarantee  the  exact  integer  result  when 
performing  purely  integer  convolution,  which  is  important  in  certain  areas  such  as  multiplication 
of  large  integers,  computer  algebra,  cryptology,  computational  number  theory,  experimental 
mathematics  and  others. 

Reduction  of  Round-off  Errors  in  Frequency  Domain-Based  Cyclic  Convolution:  Frequency- 
domain  based  cyclic  convolution  algorithms,  as  expected,  cannot  be  accelerated  using  either  our 
parallel,  recursive  or  combined  approach  but,  if  used,  we  found  that  the  round-off  errors  introduced 
by  the  FFTs  are  reduced  (because  of  their  shorter-length  sequences  in  the  subsections)  in  exchange 
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for  a  reasonable  decline  in  overall  performance.  This  could  be  important  in  certain  applications 
were  large  round-off  errors  in  purely  integer  circular  convolution  are  not  acceptable. 

Parallel  One-Dimensional  DFT:  We  were  able  to  develop  a  novel  algorithm  for  the  parallel 
implementation  of  a  1  -D  DFT  in  cluster  architectures  or  distributed  environments.  We  coded  and 
benchmarked  this  algorithm.  Albeit  not  as  efficient  as  the  established  technique  it  expands  the  range 
of  signal  lengths  that  can  be  tackled,  therefore  it  could  be  considered  as  complementary  to  the 
traditional  procedure  that  maps  a  1  -D  DFT  to  a  2-D  DFT  followed  by  a  row-column  decomposition. 

New  Research  Direction  (Parallel  Correlators  for  GPS  Receivers ):  All  of  our  parallel  structures 
for  parallel  cyclic  convolution  can  be  applied  to  parallel  circular  correlation.  Circular  correlation  is 
used,  among  others,  in  GPS  receivers.  We  developed  parallel  algorithms  that  are  complementary 
to  the  ones  presented  in  2012  by  a  group  from  the  Ecole  Polytechnique,  Laussana  (several  of  which 
are  the  same  that  we  originally  proposed  in  the  first  place).  Specifically,  we  can  propose  a  variant 
that  avoids  the  use  of  numerically  controlled  oscillators,  which  they  use  in  their  constructs,  and  we 
can  provide  further  complementary  architectures. 

Further  Code  Development  (GPS  Correlators):  We  developed  MATLAB  code  and  Simulink 
models  and  libraries  to  implement  GPS  correlators.  This  is  to  be  used  as  a  testbed  to  validate  the 
parallel  correlator  constructs  that  we  are  proposing.  We  also  did  a  MATLAB  implementation  of  an 
algorithm  developed  at  MIT  for  GPS  software  design  radio  (SDR)  receivers  based  on  the  sparse 
nature  of  the  correlator  output. 

Presentation  and  Upgrade  of  our  Hardware  Assets:  We  updated  a  cluster  from  a  previous  DoD 
grant  and  installed  the  MATLAB  Distributed  Computing  Server.  In  addition  we  added  four  GPU 
nodes  to  this  cluster  for  future  research.  The  cluster  itself  is  dated  but  useful  benchmarking, 
scalability  studies  and  educational  purposes.  We  also  defined  several  multi-core  computers  for 
benchmarking  and  research  using  the  MATLAB  Parallel  Computing  Toolbox. 

Students:  We  have  had  several  graduate  and  undergraduate  research  assistants  who  have  gained 
valuable  experience.  Two  of  our  undergraduate  research  assistants  were  awarded  summer 
internships  at  NIST  and  one  was  awarded  an  internship  with  the  Google  Summer  Code  in  order  to 
develop,  and  expand,  using  an  open  source  platform  the  same  GPS  algorithms  that  we  implemented 
in  MATLAB  and  Simulink.  All  graduates  students  are  working  or  pursuing  graduate  studies. 


Future  Work 


Complete  our  research  regarding  the  development  of  novel  parallel  circular  correlators  with 
applications  to  GPS  receivers  and  other  potential  uses. 

Continue  working  on  the  development  of  MATLAB  code,  Simulink  models  and  libraries  to  test 
our  proposed  circular  correlation  constructs  with  application  to  SDR  GPS  receivers. 
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Assess  the  viability  of  implementing  a,  MIT  developed,  circular  correlator  based  on  the  sparse 
characteristics  of  the  correlator  ouput  in  combination  with  our  parallel  circular  correlator 
structures,  possibly  using  FPGA  co-processors,  for  added  performance. 

Use  our  cluster  architecture  that  includes  4  nodes  with  GPUs  for  further  experimentation. 

Keep  involving  students  and  interact  with  faculty  to  promote  algorithm  development  and  parallel 
processing  in  our  institution. 
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